Optimizing The Putbits
Table of contents - Introduction - The 'proc' - First optimization - Second optimization - The same in pmode - Get bits - Limiting the buffer - Improving this doc - Closing words - Contacting the author
Introduction
This article is aimed to a coder who has already implemented a compressor, knows how ASM works and how to optimize for Pentium. If you have never done any of those things, then better do them, and when it's done, read this.
Everybody who has even learned ASSEMBLER has also 'studied' the putpixel. A very optimized routine. Why? That's for three reasons: first is the way for understanding how the VRAM works, second it's used always in video routines, and third, everybody showed his putpixel.
Now think about the putbits. It's almost the most used routine from compressors. A lot of clocks cycles are spent in your putbits. Do you really think you can optimize you compressor/parser/... enough? I hope so, but a fast putbits is always needed, so here we talk about it.
I know the best for optimizing is pmode, but I still don't work on it (soon I'll do so ;-), so we will use real mode, and we will optimize for a plain Pentium (586), so specific optimizations for other procesors will not be put here, unless someone really feels that's necessary. And I don't think so. E-)
This doc isn't intended to be a plain source-code, so you better understand how everything works.
The 'proc'
The routine is simple: Every time it's called it needs a byte of source, the bits to put into the output buffer, and the number of bits to write.
Then the routine puts the bits, and if the byte is full (= we wrote 8 bits) we update the output buffer. For putting the bit in the right position we shift it and then 'or' the output byte.
Here it is the putbits that I use, at least till now: ;-)
;--------------------------------------------------------------------
; Putbits
; Input: cx-> number of bits to write bl->bits to be written
; Output: nothing
; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
;--------------------------------------------------------------------
put_bits:
test cx,cx ;I used that. Put it ONLY if you know
jz short @@ptb_fin ;that maybe you will try to write 0 bits
;but better avoid writing 0 bits ;-D
push ax es di ;Save only the registers that you know
;are currently used.
mov es,buf_com ;ds is faster than any other segment
;register. But it's needed for some data
@@ptb_more_bits_: ;in my 'main' loop OE-)
push cx ;We need cx for the shl/cl
xor ax,ax ;clear it
mov bh,ptb_byte ;get the byte to finally write
mov al,bl ;get the input byte (where the bits are)
shr al,1 ;put the first bit in the carry flag
adc ah,0 ;Then to ah. I know there's salc
;but I think it's an undocumented inst.
;from Intel's Pentium, so...
mov bl,al ;save the shifted byte
mov cl,ptb_totalbits ;how many bits we wrote
shl ah,cl ;Put the bit to its position
or bh,ah ;and then update the output byte
mov ptb_byte,bh ;save the output byte
;now let's see if we must update the output buffer
inc ptb_totalbits ;we wrote one bit
cmp ptb_totalbits,8 ;did we wrote the full byte?
jb short @@ptb_no_byte ;90h
mov ptb_totalbits,0 ;start again to write the bits
mov di,ptr_com ;the pointer to the input buffer
mov es:[di],bh ;the output byte
inc di ;next position
mov ptr_com,di ;save it for next time
inc bytes_com ;the number of written bits
mov ptb_byte,0 ;erase the output byte
@@ptb_no_byte:
pop cx ;the number of bits to write
loop short @@ptb_more_bits_ ;are we done?
pop di es ax ;restore the used registers
@@ptb_fin:
ret
Problems? Here? A lot. There's no pairing, never. The shl/cl takes 4 clocks, push&pop isn't very funny, of course everything can be worse, the data may be misaligned, or in the same dword-boundary...
Let's count the clocks: 6+(16*number of bits)+6(if we update out. buffer). This is approximately the clocks for that piece of code, of course assuming that all the data is in the level 1 cache (that's like assuming Bill Gates is documenting every secret of Windows ;-).
To write one bit and update the buffer, this will take 28 clocks. I tried it with a program that reads the counters, and it took 65 clocks!
Writing 8 bits, so writing the byte to the output buffer too: 252 clocks.
First optimization
We will try to do more pairing, and loading, and saving the data only when it starts and when ends:
;--------------------------------------------------------------------
; Putbits
; Input: dx-> number of bits to write al->bits to be written
; Output: nothing
; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
;--------------------------------------------------------------------
put_bits:
mov bh,ptb_byte
mov cl,ptb_totalbits
mov es,buf_com ;yes, still using es }E-)
mov di,ptr_com
push cx di ;this pairs
@@ptb_more_bits_: ;the 'main' loop
xor ah,ah ;1 u
shr al,1 ;2 u
adc ah,0 ;3 u
shl ah,cl ;4-7 u - the bits written
or bh,ah ;8 u
inc cl ;1 v
cmp cl,8 ;9 u
jb short @@ptb_no_byte ;9 v
xor cx,cx ;1 u - the counter of written bits
mov es:[di],bh ;2 u
inc di ;2 v - next byte
mov ptr_com,di ;3 u - save only if incremented
inc bytes_com ;3 v
xor bh,bh ;4 u - output byte
@@ptb_no_byte:
dec dx ;10 u - the counter
jnz short @@ptb_more_bits_ ;10 v
mov ptb_byte,bh
mov ptb_totalbits,cl
pop di cx ;this pairs too
ret
This version has been debugged at least twice or more. It works. And it's better than the first.
7+(10*number of bits)+4(if we update out. buffer)
It's ok, don't you think? We improved the pairing, and sure that writing for example 5 bits is far better than in the first attempt.
The real timing for this is 51 clocks for one bit. For 8 bits 145! We have reduced the number of clock cycles from 252 to 145. Quite good. E-)
The bit organization is the following:
0000 0000<- first bit
||--- second bit
|---- third bit ...
Second optimization
One of the fucking things in this loop is the 4-clock shl/cl. How can we avoid it? Easy, every time we put a new bit we shift the byte to the left. This implies two things: we only spend 1 clock with the shift, BUT the order of the bits in the byte is inverted, so:
first bit->0000 0000
second bit--|| |- last bit
third bit----|
If you choose this putbits, then you MUST change your getbits, but not a lot. Your first getbits shifted to the left. This one will shift to the right.
;--------------------------------------------------------------------
; Putbits
; Input: cl-> number of bits to write al->bits to be written
; Output: nothing
; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
;--------------------------------------------------------------------
put_bits:
push di
mov es,buf_com ;some initalizations
mov di,ptr_com
mov ah,ptb_byte
mov ch,ptb_totalbits
@@ptb_more_bits_: ;the 'main' loop
shr al,1 ;u 1 - put bit in the carry flag
adc ah,0 ;u 2 - put it in output byte
inc ch ;v 2 - the count of written bits
cmp ch,8 ;3 u
jb short @@ptb_no_byte ;3 v
xor ch,ch ;1 u - the counter of written bits
mov es:[di],ah ;2 u
inc di ;2 v
mov ptr_com,di ;3 u
inc bytes_com ;3 v
xor ah,ah ;4 u - clear output byte
@@ptb_no_byte:
shl ah,1 ;u 4 - put the bit in its position
dec cl ;5 u - the counter
jnz short @@ptb_more_bits_ ;5 v
mov ptb_byte,ah
mov ptb_totalbits,ch
pop di
ret
There are some things which have to be mentioned: Why is the 'shl ah,1' after the comparison and not just after the 'inc ch'? Easy, when you have written 8 bits, you shift again to the left, then bit #7 will go out. In that way they are only shifted if the byte isn't full. Once the updating of the buffer 'shl ah,1' works, but don't worry, the content of ah is 0, so there's no problem with it. As you see, the body of the loop is smaller and all the instructions are simple and only one clock. I will not care more for 'theory' timing, so here it's the real timing: One bit: 41 - Eight bits: 110 It is better than the previous version, and it's more than 50% faster than the first. Hey, I'm proud of it. OE-) I must say that the first was very poorly coded, so it wasn't a very difficult job. E-) The same in pmode Well, since I'm now learning pmode I decided to do the same version but in pmode. This is not really very different. Anyway here it goes:
;--------------------------------------------------------------------
; Putbits. Pmode.
; Input: cl-> number of bits to write al->bits to be written
; Output: nothing
; data: ptb_byte, ptb_totalbits, output buffer, bytes_com
;--------------------------------------------------------------------
put_bits:
mov edi,ptr_com ;I like pmode E-)
mov ah,ptb_byte
mov ch,ptb_totalbits
push edi
@@ptb_more_bits_: ;the 'main' loop
shr al,1 ;u 1 - put bit in the carry flag
adc ah,0 ;u 2 - put it in output byte
inc ch ;v 2 - the counter of written bits
cmp ch,8 ;3 u
jb short @@ptb_no_byte ;3 v
xor ch,ch ;1 u - the counter of written bits
mov [edi],ah ;2 v
inc edi ;3 u
inc bytes_com ;3 v - I reduced 2 clocks putting
mov ptr_com,edi ;3 u - <- this there
xor ah,ah ;4 u - clear output byte
@@ptb_no_byte:
shl ah,1 ;u 4 - put the bit in his position
dec cl ;5 u - the counter
jnz short @@ptb_more_bits_ ;5 v
mov ptb_byte,ah
mov ptb_totalbits,ch
pop edi
ret
This is, at least till someone publishes a better one, the best put_bits out there. Putting one bit: 39 clocks. Putting 8: 105 clocks.
Get bits
This document isn't a class of "How to cut'n'paste", this is a "Learn and implement" class, or at least I wanted it to be in that way, but optimization requires examples, so the source code is here. But there's still the get bits. Once you've learned all this stuff, write your own. If you get good results, show us your implementation.
Did you really think you would get no homework? X-D
Remember that if you use the last version for getbits you must use shr, and if you use one of the others, shl. Or something like that. OE-)
Limiting the buffer
Today I was planning the memory needed for my new compressor (really a lot E-). And as memory is limited, you have to choose well how you will distribute it. Let's think. A lot of memory for the input file. More memory for the data structures needed by the compressor... So we now have almost no free mem, and we still need memory for the output buffer.
And I came to this solution:
The output buffer will be 5k long or something like that, the put_bits will have a counter with all the bytes written in this buffer (this is not bytes_com), and when it's 5k (so the buffer is full) we write it to the output file, and start again writing in the 5k buffer.
This has not been tested yet, so maybe 5k is not enough and spends a lot of time writing.
Improving this doc
Of course this doc isn't finished, it still needs YOUR help. If you improve the put_bits, or do a good get_bits, send them to me, and a new version of that doc will be released. You will be properly credited. Let's do a good piece of optimization. Of course other ideas around that will be accepted, making the put_bits get the source bits from a word, etc. If you read something that doesn't make sense or something, let me know. If you have any idea, email me: dario_phong@mixmail.com
Closing words
My words have arrived at the end, at least till someone else wants to put his work in favour of everybody. Nothing else to say. Well, something else: If I don't get any feedback I will not update this doc. E-(
Today I've written one article and finished it, and I still haven't coded my new compressor in pmode. }E-)
Contacting the author
Contacting me is as simple as sending an email. E-)
I check it almost every week, so don't worry if my email is a little bit late. And check my page, and join my email list (I'll send you an email, when I've updated my page with new things to download).